Analysis and visualization of biological data
Nuanced data manipulation and visualization; creating report files
In this part we’ll use some pre-loaded data tables already present in R, then utilize the R-packages tidyr, dplyr, and ggplot2, which are part of the Tidyverse, which is a set of R-packages intended for data science applications, with the same underlying design philosophy, grammar, and data structures.
We’ll go through some basic data-overview methods. Then some of the more important functions that help us to handle our data efficiently. Next, we’ll get acquainted with how we can produce nice figures that best visualize our data. Finally, we’ll put all these into human-readable, relatively aesthetic report files.
Throughout this session, we’ll use the below data tables:
- Theoph: pharmacokinetics of theophylline
- esoph: data on the incidence of esophagial cancer in association with alcohol and tobacco consumption
- iris: sepal length and width, and petal length and width, for 50 flowers from each of 3 species of iris: Iris setosa, I. versicolor, and I. virginica
- Indometh: pharmacokinetics of indometacin
Further information can be obtained about these by using the help function:
# with the "help()" function:
help(esoph)
# or alternatively:
?esophWe’ll use the following R-packages:
# main
require(magrittr)
require(tidyr)
require(dplyr)
require(ggplot2)
require(patchwork)
require(knitr)
require(kableExtra)
require(DT)
# optional
require(GGally)
require(ggcorrplot)
require(ggpubr)1 Handling data
1.1 Overview of data
Data are, in their basic forms, data frames. To get more information about our data even when we’re just using simple functions such as “head()”, we can convert them into “tibble” class. (This is not crucially important: data frames are just fine, but for now, let’s be fancy and use the data frames as tibbles.)
Theoph = as_tibble(Theoph)
esoph = as_tibble(esoph)
iris = as_tibble(iris)
# Theoph
summary(Theoph)## Subject Wt Dose Time conc
## 6 :11 Min. :54.60 Min. :3.100 Min. : 0.000 Min. : 0.000
## 7 :11 1st Qu.:63.58 1st Qu.:4.305 1st Qu.: 0.595 1st Qu.: 2.877
## 8 :11 Median :70.50 Median :4.530 Median : 3.530 Median : 5.275
## 11 :11 Mean :69.58 Mean :4.626 Mean : 5.895 Mean : 4.960
## 3 :11 3rd Qu.:74.42 3rd Qu.:5.037 3rd Qu.: 9.000 3rd Qu.: 7.140
## 2 :11 Max. :86.40 Max. :5.860 Max. :24.650 Max. :11.400
## (Other):66
str(Theoph)## tibble [132 × 5] (S3: tbl_df/tbl/data.frame)
## $ Subject: Ord.factor w/ 12 levels "6"<"7"<"8"<"11"<..: 11 11 11 11 11 11 11 11 11 11 ...
## $ Wt : num [1:132] 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 ...
## $ Dose : num [1:132] 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 ...
## $ Time : num [1:132] 0 0.25 0.57 1.12 2.02 ...
## $ conc : num [1:132] 0.74 2.84 6.57 10.5 9.66 8.58 8.36 7.47 6.89 5.94 ...
## - attr(*, "formula")=Class 'formula' language conc ~ Time | Subject
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "labels")=List of 2
## ..$ x: chr "Time since drug administration"
## ..$ y: chr "Theophylline concentration in serum"
## - attr(*, "units")=List of 2
## ..$ x: chr "(hr)"
## ..$ y: chr "(mg/l)"
head(Theoph)## # A tibble: 6 × 5
## Subject Wt Dose Time conc
## <ord> <dbl> <dbl> <dbl> <dbl>
## 1 1 79.6 4.02 0 0.74
## 2 1 79.6 4.02 0.25 2.84
## 3 1 79.6 4.02 0.57 6.57
## 4 1 79.6 4.02 1.12 10.5
## 5 1 79.6 4.02 2.02 9.66
## 6 1 79.6 4.02 3.82 8.58
# esoph
summary(esoph)## agegp alcgp tobgp ncases ncontrols
## 25-34:15 0-39g/day:23 0-9g/day:24 Min. : 0.000 Min. : 0.000
## 35-44:15 40-79 :23 10-19 :24 1st Qu.: 0.000 1st Qu.: 1.000
## 45-54:16 80-119 :21 20-29 :20 Median : 1.000 Median : 4.000
## 55-64:16 120+ :21 30+ :20 Mean : 2.273 Mean : 8.807
## 65-74:15 3rd Qu.: 4.000 3rd Qu.:10.000
## 75+ :11 Max. :17.000 Max. :60.000
str(esoph)## tibble [88 × 5] (S3: tbl_df/tbl/data.frame)
## $ agegp : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ alcgp : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
## $ tobgp : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
## $ ncases : num [1:88] 0 0 0 0 0 0 0 0 0 0 ...
## $ ncontrols: num [1:88] 40 10 6 5 27 7 4 7 2 1 ...
head(esoph)## # A tibble: 6 × 5
## agegp alcgp tobgp ncases ncontrols
## <ord> <ord> <ord> <dbl> <dbl>
## 1 25-34 0-39g/day 0-9g/day 0 40
## 2 25-34 0-39g/day 10-19 0 10
## 3 25-34 0-39g/day 20-29 0 6
## 4 25-34 0-39g/day 30+ 0 5
## 5 25-34 40-79 0-9g/day 0 27
## 6 25-34 40-79 10-19 0 7
# iris
summary(iris)## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
str(iris)## tibble [150 × 5] (S3: tbl_df/tbl/data.frame)
## $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)## # A tibble: 6 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
1.2 Manipulating data
1.2.1 pipe
%>%: “Pipe an object forward into a function or call expression.”
iris %>% head## # A tibble: 6 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
1.2.2 select
?select: “Select (and optionally rename) variables in a data frame, using a concise mini-language that makes it easy to refer to variables based on their name (e.g. a:f selects all columns from a on the left to f on the right) or type (e.g. where(is.numeric) selects all numeric columns).”
# look at the first few rows of selected variables with "head()"
head(select(Theoph, Time, conc))## # A tibble: 6 × 2
## Time conc
## <dbl> <dbl>
## 1 0 0.74
## 2 0.25 2.84
## 3 0.57 6.57
## 4 1.12 10.5
## 5 2.02 9.66
## 6 3.82 8.58
# alternatively, with pipe:
Theoph %>% select(Time, conc) %>% head## # A tibble: 6 × 2
## Time conc
## <dbl> <dbl>
## 1 0 0.74
## 2 0.25 2.84
## 3 0.57 6.57
## 4 1.12 10.5
## 5 2.02 9.66
## 6 3.82 8.58
# or, in a more "super-fancy-professional" way:
Theoph %>%
select(Time, conc) %>%
head## # A tibble: 6 × 2
## Time conc
## <dbl> <dbl>
## 1 0 0.74
## 2 0.25 2.84
## 3 0.57 6.57
## 4 1.12 10.5
## 5 2.02 9.66
## 6 3.82 8.58
Note that this is equivalent to:
head(Theoph[,c("Time", "conc")])## # A tibble: 6 × 2
## Time conc
## <dbl> <dbl>
## 1 0 0.74
## 2 0.25 2.84
## 3 0.57 6.57
## 4 1.12 10.5
## 5 2.02 9.66
## 6 3.82 8.58
# or
Theoph[,c("Time", "conc")] %>%
head## # A tibble: 6 × 2
## Time conc
## <dbl> <dbl>
## 1 0 0.74
## 2 0.25 2.84
## 3 0.57 6.57
## 4 1.12 10.5
## 5 2.02 9.66
## 6 3.82 8.58
Exclude specific variables (columns):
Theoph %>%
select(-Subject) %>%
head## # A tibble: 6 × 4
## Wt Dose Time conc
## <dbl> <dbl> <dbl> <dbl>
## 1 79.6 4.02 0 0.74
## 2 79.6 4.02 0.25 2.84
## 3 79.6 4.02 0.57 6.57
## 4 79.6 4.02 1.12 10.5
## 5 79.6 4.02 2.02 9.66
## 6 79.6 4.02 3.82 8.58
1.2.3 filter
?filter: “The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions. Note that when a condition evaluates to NA the row will be dropped, unlike base subsetting with [.”
Theoph %>%
filter(Subject == 2) %>%
head## # A tibble: 6 × 5
## Subject Wt Dose Time conc
## <ord> <dbl> <dbl> <dbl> <dbl>
## 1 2 72.4 4.4 0 0
## 2 2 72.4 4.4 0.27 1.72
## 3 2 72.4 4.4 0.52 7.91
## 4 2 72.4 4.4 1 8.31
## 5 2 72.4 4.4 1.92 8.33
## 6 2 72.4 4.4 3.5 6.85
Theoph %>%
filter(Subject != 2, !is.na(Time)) %>%
head## # A tibble: 6 × 5
## Subject Wt Dose Time conc
## <ord> <dbl> <dbl> <dbl> <dbl>
## 1 1 79.6 4.02 0 0.74
## 2 1 79.6 4.02 0.25 2.84
## 3 1 79.6 4.02 0.57 6.57
## 4 1 79.6 4.02 1.12 10.5
## 5 1 79.6 4.02 2.02 9.66
## 6 1 79.6 4.02 3.82 8.58
1.2.4 mutate
?mutate: “mutate() creates new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL).”
Theoph %>%
mutate(mins = Time*60 ) %>%
head## # A tibble: 6 × 6
## Subject Wt Dose Time conc mins
## <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 79.6 4.02 0 0.74 0
## 2 1 79.6 4.02 0.25 2.84 15
## 3 1 79.6 4.02 0.57 6.57 34.2
## 4 1 79.6 4.02 1.12 10.5 67.2
## 5 1 79.6 4.02 2.02 9.66 121.
## 6 1 79.6 4.02 3.82 8.58 229.
esoph %>%
mutate(case.rate = ncases/(ncases+ncontrols) ) %>%
head## # A tibble: 6 × 6
## agegp alcgp tobgp ncases ncontrols case.rate
## <ord> <ord> <ord> <dbl> <dbl> <dbl>
## 1 25-34 0-39g/day 0-9g/day 0 40 0
## 2 25-34 0-39g/day 10-19 0 10 0
## 3 25-34 0-39g/day 20-29 0 6 0
## 4 25-34 0-39g/day 30+ 0 5 0
## 5 25-34 40-79 0-9g/day 0 27 0
## 6 25-34 40-79 10-19 0 7 0
1.2.5 group_by & summarize
esoph %>%
group_by(agegp) %>%
summarize(mean(ncases))## # A tibble: 6 × 2
## agegp `mean(ncases)`
## <ord> <dbl>
## 1 25-34 0.0667
## 2 35-44 0.6
## 3 45-54 2.88
## 4 55-64 4.75
## 5 65-74 3.67
## 6 75+ 1.18
esoph %>%
group_by(alcgp, tobgp) %>%
summarize(mean(ncases))## `summarise()` has grouped output by 'alcgp'. You can override using the
## `.groups` argument.
## # A tibble: 16 × 3
## # Groups: alcgp [4]
## alcgp tobgp `mean(ncases)`
## <ord> <ord> <dbl>
## 1 0-39g/day 0-9g/day 1.5
## 2 0-39g/day 10-19 1.67
## 3 0-39g/day 20-29 1
## 4 0-39g/day 30+ 0.833
## 5 40-79 0-9g/day 5.67
## 6 40-79 10-19 2.83
## 7 40-79 20-29 2.5
## 8 40-79 30+ 1.8
## 9 80-119 0-9g/day 3.17
## 10 80-119 10-19 3.17
## 11 80-119 20-29 1.5
## 12 80-119 30+ 1.4
## 13 120+ 0-9g/day 2.67
## 14 120+ 10-19 2
## 15 120+ 20-29 1.4
## 16 120+ 30+ 2.5
1.2.6 arrange
?arrange: “arrange() orders the rows of a data frame by the values of selected columns. Unlike other dplyr verbs, arrange() largely ignores grouping; you need to explicitly mention grouping variables (or use .by_group = TRUE) in order to group by them, and functions of variables are evaluated once per data frame, not once per group.”
# arranging in increasing order
esoph %>%
group_by(agegp, tobgp) %>%
summarize(mean.cases=mean(ncases)) %>%
arrange(-mean.cases)## `summarise()` has grouped output by 'agegp'. You can override using the
## `.groups` argument.
## # A tibble: 24 × 3
## # Groups: agegp [6]
## agegp tobgp mean.cases
## <ord> <ord> <dbl>
## 1 65-74 0-9g/day 7.75
## 2 55-64 0-9g/day 6.25
## 3 55-64 10-19 5.75
## 4 55-64 30+ 4
## 5 45-54 0-9g/day 3.5
## 6 45-54 10-19 3.25
## 7 55-64 20-29 3
## 8 65-74 10-19 3
## 9 45-54 30+ 2.75
## 10 65-74 20-29 2.5
## # … with 14 more rows
# arranging in decreasing order
esoph %>%
group_by(alcgp, tobgp) %>%
summarize(mean.cases=mean(ncases)) %>%
arrange(-mean.cases)## `summarise()` has grouped output by 'alcgp'. You can override using the
## `.groups` argument.
## # A tibble: 16 × 3
## # Groups: alcgp [4]
## alcgp tobgp mean.cases
## <ord> <ord> <dbl>
## 1 40-79 0-9g/day 5.67
## 2 80-119 0-9g/day 3.17
## 3 80-119 10-19 3.17
## 4 40-79 10-19 2.83
## 5 120+ 0-9g/day 2.67
## 6 40-79 20-29 2.5
## 7 120+ 30+ 2.5
## 8 120+ 10-19 2
## 9 40-79 30+ 1.8
## 10 0-39g/day 10-19 1.67
## 11 0-39g/day 0-9g/day 1.5
## 12 80-119 20-29 1.5
## 13 80-119 30+ 1.4
## 14 120+ 20-29 1.4
## 15 0-39g/day 20-29 1
## 16 0-39g/day 30+ 0.833
1.2.7 count
Simple counting of levels of categorical variables:
iris %>%
count(Species)## # A tibble: 3 × 2
## Species n
## <fct> <int>
## 1 setosa 50
## 2 versicolor 50
## 3 virginica 50
# alternatively:
table(iris$Species)##
## setosa versicolor virginica
## 50 50 50
# or
xtabs(~Species, data = iris)## Species
## setosa versicolor virginica
## 50 50 50
Cross tabulation (a.k.a. contingency table, i.e. checking the number of observations corresponding to levels from two variables):
esoph %>%
count(agegp, alcgp)## # A tibble: 24 × 3
## agegp alcgp n
## <ord> <ord> <int>
## 1 25-34 0-39g/day 4
## 2 25-34 40-79 4
## 3 25-34 80-119 3
## 4 25-34 120+ 4
## 5 35-44 0-39g/day 4
## 6 35-44 40-79 4
## 7 35-44 80-119 4
## 8 35-44 120+ 3
## 9 45-54 0-39g/day 4
## 10 45-54 40-79 4
## # … with 14 more rows
# alternatively:
table(esoph$agegp, esoph$alcgp)##
## 0-39g/day 40-79 80-119 120+
## 25-34 4 4 3 4
## 35-44 4 4 4 3
## 45-54 4 4 4 4
## 55-64 4 4 4 4
## 65-74 4 3 4 4
## 75+ 3 4 2 2
# or
xtabs(~agegp+alcgp, data = esoph)## alcgp
## agegp 0-39g/day 40-79 80-119 120+
## 25-34 4 4 3 4
## 35-44 4 4 4 3
## 45-54 4 4 4 4
## 55-64 4 4 4 4
## 65-74 4 3 4 4
## 75+ 3 4 2 2
1.2.8 reshaping data frames
In general, data tables can either be in a “long” or a “wide” format, which refers to how observations are recorded in the table. In the long format the repeated measurements are in separate rows and the corresponding name is coded as another variable; in the wide format, repeated measurements are in separate vairables (columns) but in the same row. Some analysis or visualization methods require specific data structures, for which it often may be necessary to reshape our data frames.
table formats
Wide to long:
head(iris)## # A tibble: 6 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
iris.long = iris %>%
pivot_longer(
cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
names_to = "flower.part", values_to = "value")
head(iris.long)## # A tibble: 6 × 3
## Species flower.part value
## <fct> <chr> <dbl>
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Width 3.5
## 3 setosa Petal.Length 1.4
## 4 setosa Petal.Width 0.2
## 5 setosa Sepal.Length 4.9
## 6 setosa Sepal.Width 3
# alternatively
iris.long = iris %>%
pivot_longer(
cols = c(Sepal.Length:Petal.Width),
names_to = "flower.part", values_to = "value")
head(iris.long)## # A tibble: 6 × 3
## Species flower.part value
## <fct> <chr> <dbl>
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Width 3.5
## 3 setosa Petal.Length 1.4
## 4 setosa Petal.Width 0.2
## 5 setosa Sepal.Length 4.9
## 6 setosa Sepal.Width 3
Long to wide:
Indometh %>%
pivot_wider(
names_from = time,
values_from = conc, names_prefix = "timepoint_")## # A tibble: 6 × 12
## Subject timepoint_0.25 timep…¹ timep…² timep…³ timep…⁴ timep…⁵ timep…⁶ timep…⁷
## <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1.5 0.94 0.78 0.48 0.37 0.19 0.12 0.11
## 2 2 2.03 1.63 0.71 0.7 0.64 0.36 0.32 0.2
## 3 3 2.72 1.49 1.16 0.8 0.8 0.39 0.22 0.12
## 4 4 1.85 1.39 1.02 0.89 0.59 0.4 0.16 0.11
## 5 5 2.05 1.04 0.81 0.39 0.3 0.23 0.13 0.11
## 6 6 2.31 1.44 1.03 0.84 0.64 0.42 0.24 0.17
## # … with 3 more variables: timepoint_5 <dbl>, timepoint_6 <dbl>,
## # timepoint_8 <dbl>, and abbreviated variable names ¹timepoint_0.5,
## # ²timepoint_0.75, ³timepoint_1, ⁴timepoint_1.25, ⁵timepoint_2, ⁶timepoint_3,
## # ⁷timepoint_4
# first reshape iris into long format, but now with a row-ID
iris.2 = iris
iris.2$id = 1:nrow(iris.2)
iris.long = iris.2 %>%
pivot_longer(
cols = c(Sepal.Length:Petal.Width),
names_to = "flower.part", values_to = "value")
# long to wide
iris.long %>%
pivot_wider(
names_from = flower.part,
values_from = value, id_cols = c(id, Species)) %>%
head## # A tibble: 6 × 6
## id Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <int> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 setosa 5.1 3.5 1.4 0.2
## 2 2 setosa 4.9 3 1.4 0.2
## 3 3 setosa 4.7 3.2 1.3 0.2
## 4 4 setosa 4.6 3.1 1.5 0.2
## 5 5 setosa 5 3.6 1.4 0.2
## 6 6 setosa 5.4 3.9 1.7 0.4
1.3 Showcasing tables
1.3.1 knitr & kableExtra
esoph %>%
head() %>%
knitr::kable() %>%
kable_styling() %>%
print()| agegp | alcgp | tobgp | ncases | ncontrols |
|---|---|---|---|---|
| 25-34 | 0-39g/day | 0-9g/day | 0 | 40 |
| 25-34 | 0-39g/day | 10-19 | 0 | 10 |
| 25-34 | 0-39g/day | 20-29 | 0 | 6 |
| 25-34 | 0-39g/day | 30+ | 0 | 5 |
| 25-34 | 40-79 | 0-9g/day | 0 | 27 |
| 25-34 | 40-79 | 10-19 | 0 | 7 |
header_vec = c(1,4,1) # +1 because we also display rownames
names(header_vec) = c(" ", "Flower parts", " ")
iris %>%
head() %>%
knitr::kable(row.names=T, align = "c") %>%
kable_styling(full_width = F) %>%
add_header_above(header_vec) %>%
print()| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| 6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
1.3.2 DT
iris %>%
datatable(options = list(pageLength = 10))2 Visualizing data
There is a huge number of ways by which data visualization can be carried out, even within R: from the most basic functions such as “plot()”, “boxplot()”, and “hist()”, to very complex plotting packages, like plotly. One of the most flexible, robust, and widely used among the graphical R-packages is ggplot2 (and its numerous complementing packages to improve and expand its functionality).
“ggplot2 has an underlying grammar, based on the Grammar of Graphics (Wilkinson, 2005), that allows you to compose graphs by combining independent components. This makes ggplot2 powerful. Rather than being limited to sets of pre-defined graphics, you can create novel graphics that are tailored to your specific problem.”
The main idea is to provide layers of visualization and the grammar is following the “logic of layers”, as each component or modification to the figure we’d like to create comes from flexible but specialized functions. All ggplot2 figures are built from two components: the data and a mapping. For the latter, additional elements include the layer (with geometric and statistical information), scale, coordinate system, faceting, and theme.
2.1 Basic plots
# scatterplot
p1 = ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
p1# boxplot
p2 = ggplot(iris, aes(x = Species, y = Sepal.Width)) +
geom_boxplot()
p2# barplot
df.1 = as.data.frame(table(esoph$agegp))
p3 = ggplot(df.1, aes(x = Var1, y = Freq)) +
geom_bar(stat = "identity")
p3# or, alternatively for the above barplot:
p3.2 = ggplot(esoph, aes(x = agegp, y = after_stat(count))) +
geom_bar()
p3.2# histogram
p4 = ggplot(Theoph, aes(x = conc)) +
geom_histogram(bins = 50)
p4# density plot
p4.2 = ggplot(Theoph, aes(x = conc)) +
geom_density()
p4.22.2 Shaping plots
There is a large number of pre-defined colors in R: R color chart. Also, on scatter plots and line plots the shape of visualzing elements can be altered as well, using pre-defined plot symbol shapes and line types.
p5 = ggplot(esoph, aes(ncases)) +
geom_bar(aes(fill=agegp))
p5p6 = ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species, shape = Species)) +
geom_point(size = 3, alpha = 0.5) +
stat_ellipse(type="t") +
scale_color_manual(values = c(setosa = "cyan", versicolor = "red", virginica = "green")) +
scale_shape_manual(values = c(setosa = 15, versicolor = 16, virginica = 17)) +
xlab("Sepal length") +
ylab("Sepal width") +
ggtitle("iris data") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
p6# example using the long format of iris:
p6.2 = ggplot(iris.long, aes(value, fill=Species)) +
geom_density(alpha = 0.5) +
facet_wrap(~flower.part, scales = "free") +
theme_bw()
p6.2p7 = ggplot(Theoph, aes(x = Time, y = conc, color = Subject, linetype = Subject)) +
geom_line() +
geom_point() +
theme_bw()
p7p8 = ggplot(iris, aes(x = Species, y = Sepal.Width, color = Species)) +
geom_boxplot() +
geom_jitter(size = 3, alpha = 0.9) +
xlab("spp.") +
ylab("Sepal width") +
coord_flip() +
theme_bw()
p8# with denoting of significant group differences, using the R-package "ggpubr"
# also note that in this case, not "color" is used in "aes()", but "fill"!
# (Note: "stat_compare_means" uses simple t-test by default, so should only be used
# on data with normal distribution.)
## "method" can be either one of these:
## t.test, wilcox.test, kruskal.test, anova
comparison.list = list(
c("setosa", "versicolor"),
c("setosa", "virginica"),
c("versicolor", "virginica")
)
p8.2 = ggplot(iris, aes(x = Species, y = Sepal.Width, fill = Species)) +
geom_boxplot(alpha = 0.5) +
xlab("spp.") +
ylab("Sepal width") +
theme_bw() +
stat_compare_means(
method = "t.test",
comparisons = comparison.list,
label = "p.signif",
symnum.args = list(cutpoints = c(0, 0.001, 0.01, 0.05, 1),
symbols = c("***", "**", "*", "ns"))
)
p8.2# alternatively, with P-values:
p8.3 = ggplot(iris, aes(x = Species, y = Sepal.Width, fill = Species)) +
geom_boxplot(alpha = 0.5) +
xlab("spp.") +
ylab("Sepal width") +
theme_bw() +
stat_compare_means(
comparisons = comparison.list,
label = "p.format"
)
p8.3p9 = ggplot(iris, aes(x = Species, y = Sepal.Width, color = Species)) +
geom_violin(draw_quantiles = c(0.5), fill = NA) +
geom_jitter(size = 3, alpha = 0.5) +
xlab("spp.") +
ylab("Sepal width") +
coord_flip() +
theme_bw()
p92.3 Facets by groups
p7.2 = ggplot(Theoph, aes(x = Time, y = conc, color = Subject, linetype = Subject)) +
geom_line() +
geom_point() +
facet_wrap(~Dose) +
theme_bw()
p7.2esoph = esoph %>%
mutate(rate = ncases/sum(ncases+ncontrols))
esoph$line.group = paste(esoph$tobgp, esoph$alcgp, sep="__")
p7.3 = ggplot(esoph, aes(x = agegp, y = rate, color = agegp)) +
geom_point(size = 3) +
facet_grid(tobgp~alcgp) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
geom_line(
inherit.aes = F, data = esoph,
mapping = aes(x = agegp, y = rate, color = agegp, group = line.group))
p7.32.4 geom_smooth
p10 = ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(size = 3, alpha = 0.5) +
geom_smooth(method = "lm", formula = y~x, se = T) +
xlab("Sepal length") +
ylab("Sepal width") +
theme_bw()
p10p11 = ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(size = 3, alpha = 0.5) +
geom_smooth(method = "loess", formula = y~x, se = F) +
xlab("Sepal length") +
ylab("Sepal width") +
theme_bw()
p11p12 = ggplot(Theoph, aes(x = Time, y = conc, color = Subject)) +
geom_point(size = 3, alpha = 0.5) +
geom_smooth(data = Theoph, mapping = aes(x = Time, y = conc),
formula = y~x, method = "loess", inherit.aes = F) +
theme_bw()
p122.5 Some more…
require(ggalt)
p6.2 = ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species, shape = Species)) +
geom_point(size = 3, alpha = 0.5) +
geom_encircle() +
scale_shape_manual(values = c(setosa = 15, versicolor = 16, virginica = 17)) +
xlab("Sepal length") +
ylab("Sepal width") +
ggtitle("iris data") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
p6.2require(GGally)
ggpairs(iris, mapping = aes(color = Species, alpha=0.5)) +
theme_bw()require(Hmisc)
Cmat = rcorr(as.matrix(iris[,1:4]))
Cmat$P[is.na(Cmat$P)] = 0
ggcorrplot(
corr = Cmat$r, p.mat = Cmat$P,
method = "circle", hc.order = TRUE,
sig.level = 0.05) +
theme(plot.title = element_text(hjust = 0.5))require(ggExtra)
p13 = ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(size = 3, alpha = 0.5) +
geom_smooth(method = "lm", formula = y~x, se = T) +
xlab("Sepal length") +
ylab("Sepal width") +
theme_bw()
ggMarginal(p13, type = "density", groupColour = T, fill = "transparent")ggMarginal(p13, type = "boxplot", groupColour = T)2.6 Arranging multiple plots
A useful package, which can help to create publication-ready composite figures: patchwork.
require(patchwork)
p1 + p2(p10 | p9) / p4p5 = p5 + theme_bw()
p7 = p7 + guides(color="none", linetype="none")
p12 = p12 + guides(color="none")
p7.2 / (p5 + p7 + p12)2.7 Export plots
# save latest printed plot
ggsave(
filename = "plot_p7.png",
path = "path/to/folder",
device = "png",
dpi = 500,
width = 5,
height = 5,
units = "in"
)
# save one specific plot
ggsave(
plot = p7,
filename = "plot_p7.png",
path = "path/to/folder",
device = "png",
dpi = 500,
width = 5,
height = 5,
units = "in"
)3 Practice
The practicing task will be simply to use the iris data table, and use some of the data manipulation methods from above, then create a simple docx report file.
- Print out the mean values of
Sepal.Lengthseparately for each species fromiris, using thegroup_by()andsummarize()functions! - Using the presented techniques, create a data table named
iris.r1in which onlyversicolorspecies are retained! - In this new
iris.r1create a new variable calledPetal.mean, which contains the mean ofPetal.LengthandPetal.Width! - Arrange the observations within the
iris.r1by sepal length, and print the first 6 rows! - Create the long format version of
iris.r1and save it in a new variable asiris.r1.long! - Create a html report file, in which:
- code chunks are not shown;
- warnings and messages are not shown from the code chunks;
- there is a header with your name;
- there is a simple text block with the Description
from the help of the
irisdata table (which help can be accessed with?iris); - the head of the
iris.r1.longtable is printed out, using thekintrandkableExtrapackages; - there is a boxplot enhanced with jitter (
geom_jitter) to visualize flower part values fromiris.r1.long, in a way that (i) flower parts are shown on the x-axis (horizontal axis) and measurement values are shown on the y-axis (vertical axis), and (ii) coloration is done based on flower parts!
- Create a docx report, which only includes the header, the description text, and the boxplot!